Polish Morphological Guesser Based on a Statistical A Tergo Index

نویسندگان

  • Maciej Piasecki
  • Adam Radziszewski
چکیده

We present a direct method of construction of a morphosyntactic guesser for Polish, which is a program producing morphosyntactic descriptions for word forms unknown to the morphological analyser. The core of the method is the construction of a statistical a tergo index, in which pseudo-suffixes (endings) extracted by a statistical tree define morpho-syntactic properties of corresponding word forms. The secondary aim was to investigate to what extent it is possible to develop the morphological analyses exclusively on the basis of endings. Experiments in the extraction of a guesser for a domain of texts are also presented. The method can be applied to any other inflectional language with only minor technical changes.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Towards Czech Morphological Guesser

This paper presents a morphological guesser for Czech based on data from Czech morphological analyzer ajka [1]. The idea behind the presented concept lies in a presumption that the new (and therefore unknown to the analyzer) words in a language behave quite regularly and that a description of this regular behaviour can be extracted from the existing data of the morphological analyzer. The paper...

متن کامل

comparing a statistical and a constraint - based method

In this paper we compare two competing approaches to part-of-speech tagging, statistical and constraint-based disam-biguation, using French as our test language. We imposed a time limit on our experiment: the amount of time spent on the design of our constraint system was about the same as the time we used to train and test the easy-to-implement statistical model. We describe the two systems an...

متن کامل

Handling Unknown Words in Arabic FST Morphology

A morphological analyser only recognizes words that it already knows in the lexical database. It needs, however, a way of sensing significant changes in the language in the form of newly borrowed or coined words with high frequency. We develop a finite-state morphological guesser in a pipelined methodology for extracting unknown words, lemmatizing them, and giving them a priority weight for inc...

متن کامل

Describing Linde’s Dictionary of Polish for Digitalisation Purposes

The present paper describes the attempts at digitalising the so called Linde’s dictionary of Polish published in 6 volumes between 1807 and 1814 by Samuel Bogumił Linde. We are working on a formal description of the dictionary’s structure, whose purpose will be to allow programmers to design a tool for automatic tagging of the text. The dictionary is multilingual, so performing OCR with good qu...

متن کامل

Combining Symbolic and Statistical Methods in Morphological Analysis and Unknown Word Guessing

Highly inflectional/agglutinative languages like Hungarian typically feature possible word forms in such a magnitude that automatic methods that provide morphosyntactic annotation on the basis of some training corpus often face the problem of data sparseness. A possible solution to this problem is to apply a comprehensive morphological analyser, which is able to analyse almost all wordforms all...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007